System  and method for using graph transduction techniques to make relational classifications on a single connected network

ABSTRACT

A system and method for extending partially labeled data graphs to unlabeled nodes in a single network classification by weighting the data with a weight matrix that uses a modified graph Laplacian based regularization framework and applying graph transduction methods to the weighted data. The technique may be applied to data graphs that are directed or undirected, that may or may not have attributes and that may be homogeneous or heterogeneous.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to techniques for statisticalrelational learning, and more particularly to techniques for makingrelational classifications on a single connected network.

2. Background Description

Given the prevalence of large connected relational graphs across diversedomains, single or within network classification has been one of thepopular endeavors in statistical relational learning (SRL) research.Ranging from social networking websites to movie databases to citationnetworks, large connected relational graphs are banal. In single networkclassification, we have a partially labeled data graph and the goal isto extend this labeling, as accurately as possible, to the unlabelednodes. The nodes themselves may or may not have associated attributes.An example where within network classification could be useful is informing common interest groups on social networking websites. Forinstance, a group of people in the same geography may be interested inplaying soccer and they would be interested in finding more people whoare likely to have the same interest. In a different domain such asentertainment, one might be interested in estimating which of the newmovies is likely to make a splash at the box office. Based on thesuccess of other movies that had some of the same actors and/or the samedirector, one could provide a reasonable estimate of which movies aremost likely to be successful.

Many methods that learn and infer over a data graph have been developedin SRL literature. Some of the more effective methods perform collectiveclassification, that is, besides using the attributes of the unlabelednode to infer its label, they also use attributes and labels of relatednodes/entities. These are thus a generalization of methods that assumethat the data is independently and identically distributed (i.i.d.).Examples of such methods are relational Markov networks (RMNs),relational dependency networks (RDNs), Markov logic networks (MLNs), andprobabilistic relational models (PRMs). These all fall under theumbrella of Markov networks. There have been simpler models suggested asbaselines, such as relational neighbor classifiers (RN) which simplychoose the most numerous class label amongst their neighbors to moreinvolved variants such as those using relaxation labeling.Interestingly, these simple models perform quite well when theauto-correlation is high, even though the graph may be sparsely labeled.Recently, a pseudo-likelihood expectation maximization (PL-EM) methodwas introduced, which seems to perform favorably to other methods whenthe graph has a moderate number (around 20-30%) of labeled nodes.

A different class of methods that could potentially address the problemat hand are graph transduction methods, which are a part ofsemi-supervised learning methods and in some sense are the i.i.d.counterpart of relational methods. These methods typically perform wellwhen we are given a weighted graph and the linked nodes have mostly thesame labels—unless apriori dissimilar nodes are explicitly specified —,even if only a small fraction of the labels are known. If a weightedgraph is not readily available, it is constructed from the (explanatory)attributes of the nodes. If an unweighted graph with no attributes isgiven, then the adjacency matrix is passed as input.

In relational learning, the graphs are typically unweighted andsometimes may not have attributes. In many cases, the attributes may notaccurately predict the labels, in which case, weighting the edges solelyon them may not provide acceptable results. The links could be viewed asan additional source of information to determine labels amongstconnected nodes. Thus, the weights should also be functions of the knownlabeling. Some of these intuitions are captured in the relationalgaussian process model, but it is limited to undirected graphs and thesuggested kernel function is not easy to adapt to relational settingswhere we may have heterogeneous data.

SUMMARY OF THE INVENTION

The present invention provides a lucid way to effectively leverage arich class of graph transduction methods, namely those based on thegraph Laplacian regularization framework, to make within networkrelational classifications. Among the existing graph transductionmethods, this class of methods is considered to be one of the mostefficient and accurate in real applications. In particular, theinvention provides a procedure to learn a weight matrix for a graph thatmay be directed or undirected, that may exhibit positive or negativeauto-correlation and where the edges in the graph may be between labelednodes, between unlabeled nodes or between a labeled and an unlabelednode.

The inventive methodology first provides a solution for a graph wherenodes have no attributes, only class labels. We then extend the solutionto include attributes (and heterogenous data) by incorporating a conicalweighting scheme that weighs importance of the links relative to theattributes. The construction of the weight matrix assumes binarylabeling. However, recursive application of the chosen graphtransduction method with reconstruction of the weight matrix willaccomplish multi-class classification as is shown in the experiments onreal data in connection with FIGS. 8A and 8B.

When we have a connected unweighted homogeneous/heterogeneous graph thatis partially labeled, the goal is to propagate the labels to theunlabeled nodes. In this disclosure, we provide a different perspectiveon this problem by enabling the effective use of graph transductiontechniques. We accomplish this by providing a novel procedure forconstructing a weight matrix that serves as input to a rich class ofgraph transduction techniques. Our procedure has multiple desirableproperties. For example, the weights it assigns to edges betweenunlabeled nodes naturally relate to a measure of association commonlyused in statistics, namely the Gamma test statistic. We further portraythe efficacy of our approach on synthetic as well as real data, bycomparing it with state-of-the-art relational learning algorithms, andgraph transduction techniques using a binary adjacency matrix or a realvalued weight matrix computed using available attributes as input. Inthese experiments we see that our approach consistently outperformsother approaches when the graph is sparsely labeled, and remainscompetitive with the best when the proportion of known labels increases.

The invention provides a method and system for extending a partiallylabeled data graph to unlabeled nodes in a single networkclassification. The invention operates by constructing a weight matrixfor data in a single network classification, applying the weight matrixto the data, and then applying a graph transduction method to theweighted data to generate labels for the unlabeled nodes. In oneimplementation the weight matrix uses a modified graph Laplacian basedregularization framework. In one aspect of the method and system, theedges of the data graph are partitioned into categories, weights areassigned to each category, and each edge is assigned the weight of itsrespective category. In another implementation the categories are edgesbetween nodes with the same label, edges between nodes with oppositelabels, edges between unlabeled nodes, edges between an unlabeled nodeand a node with a label 1, and edges between an unlabeled node and anode with a label −1.

It is also an aspect of the invention to assign weights to edges betweenunlabeled nodes, where the assigned weight denotes an expectation basedon a distribution of edges that have labels. In a variation on thisimplementation, edges between an unlabeled node and a labeled node areassigned a weight denoting an expectation based on a distribution ofedges that have labels, where the distribution is limited to those edgeshaving one node equal to the labeled node. A further variation on thisimplementation is to assign to each edge a weight that is a conicalcombination of a weight based on the respective category and a weightbased on affinity of attribute values of nodes connected by the edge. Inyet another implementation, applying a graph transduction method isaccomplished by imposing a tradeoff between a fitting accuracy of aprediction function on labeled data and a smoothness of the predictionfunction over the graph. It is a further aspect of the invention toestimate the smoothness of the prediction function for the graphLaplacian based regularization framework, and modifying the predictionfunction to ensure compatibility between the graph transduction methodand the graph Laplacian based regularization framework.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, aspects and advantages will be betterunderstood from the following detailed description of a preferredembodiment of the invention with reference to the drawings, in which:

FIG. 1 is an example input graph (T) to the invention's constructionmethod.

FIG. 2 is a weighted version T_(w) of graph T shown in FIG. 1.

FIG. 3 shows instantiation of graph T_(w) when the labeled edges haveonly nodes with the same labels.

FIG. 4 shows instantiation of graph T_(w) when the labeled edges haveonly nodes with different labels.

FIG. 5A represents a relational schema with node types Paper and Author,where the relationship between them is many-to-many; FIG. 5B is thecorresponding data graph which shows authors linked to the papers thatthey authored or co-authored.

FIG. 6A is a set of graphs generated by applying the inventive method topreferential attachment synthetic data where the auto-correlation ishigh; FIG. 6B is a set of graphs generated by applying the inventivemethod to preferential attachment synthetic data where theauto-correlation is low.

FIG. 7A is a set of graphs generated by applying the inventive method toforest fire synthetic data where the auto-correlation is high; FIG. 7Bis a set of graphs generated by applying the inventive method to forestfire synthetic data where the auto-correlation is low.

FIG. 8A is a set of graphs generated by applying the inventive method toa collection of web pages known as the WEBKB dataset; FIG. 8B is a setof graphs generated by applying the inventive method to a collection ofsales information about bread products known as the BREAD dataset.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION

The notation used in this disclosure is described in the followingtable, where graph type “D” is directed and graph type “U” isundirected:

TABLE 1 Symbol Graph Type Symantics N_(q) D and U Number of nodes withlabel q N_(qr) D Number of edges from node with label q into node withlabel r N_(qr) U When q = r, Number of edges between node with label qand node with label r When q /= r, Half of the number of edges betweenN_(p) D and U Total number of labeled edges i.e. edges where both nodesare labeled P_(same) D and U Ratio of the number of edges between nodeswith same label to total number of labeled edges P_(opp) D and U Ratioof the number of edges between nodes with different labels to totalnumber of labeled edges D D and U Distribution over labeled edges

Weight Matrix Construction

In this section we elucidate a way of constructing the weight matrix fora partially labeled graph G(V, E) where V is the set of nodes and E theset of edges. We assume that the labeling is binary, i.e. any labelednode i has a label Y_(i)ε{1,−1}. As mentioned before, the procedure ofconstructing the weight matrix W, which serves as input to a graphtransduction technique, could be applied recursively or iteratively toeach (binary) classified portion, to attain multi-class classification.Hence, the input in any run to our weight matrix construction method isa partially (binary) labeled graph as shown in FIG. 1.

Given our setup, a partially labeled graph G has 3 types of nodes andconsequently 9 types of edges for a directed graph while 6 types ofedges for an undirected one. A node could be labeled 1 or −1 or may beunlabeled. An edge could be between two nodes with the same label (i.e.(1→1) or (−1→−1)) or between two oppositely labeled nodes (i.e. (1→−1)or (−1→1)) or between a labeled and unlabeled node (i.e. (1→?) or (−1→?)or (?→1) or (?→1)) or between two unlabeled nodes (i.e. (?→?)). Anundirected example graph T is shown in FIG. 1. Our task then is toassign weights to each of these types of edges.

Notation

Before we describe the weights we assign to the different types ofedges, we introduce some notation. Given a graph G, let N_(q) denote thenumber of nodes with label q. Let N_(qr) denote the number of edges fromnode with label q into node with label r. In an undirected graph, thiswould be the number of edges between nodes labeled q and r, if q=r. Ifq≠r, then N_(qr) would be half of the number of edges between q and r.Notice that N_(qr) for q≠r could thus be a float, but we do this to makethe formulae in this paper consistent irrespective of whether we have adirected or an undirected graph. Let N_(p) denote the total number oflabeled edges, i.e. the total number of edges where both nodes arelabeled. In other words, N_(p)=N₁₁+N⁻¹¹+N¹⁻¹+N⁻¹⁻¹. With this let,

$\begin{matrix}{{P_{same} = \frac{N_{11} + N_{{- 1},{- 1}}}{N_{p}}},{P_{opp} = \frac{N_{1 - 1} + N_{- 11}}{N_{p}}}} & (1)\end{matrix}$

Hence, P_(same)+P_(opp)=1. We denote this empirical distribution derivedfrom labeled edges by D. A summary of this notation for directed andundirected graphs is shown in Table 1.

Assignment of Weights

We now describe our weight matrix construction which applies to bothdirected and undirected graphs. We partition the types of edges intofive categories and suggest a way of assigning weights to edges in eachof these categories.

-   -   Edges between nodes with the same label: If an edge is between        nodes having the same label, that is if node i and node j have        the same label, we assign a weight W_(ij)=P_(same) to that edge.        This makes intuitive sense since we want to weigh the edge based        on how likely it is to have nodes with the same label being        connected.    -   Edges between nodes with opposite/different labels: If an edge        is between nodes with opposite labels, that is if node i and        node j have different labels, we assign a weight W_(ij)=−P_(opp)        to that edge. This is also intuitive since, we want to weigh the        edge based on how likely it is to have nodes with opposite        labels connected. We assign a negative sign since simply        assigning the magnitude will not create a distinction between        nodes labeled alike and those with different labels.    -   Edges between unlabeled nodes: If an edge is between unlabeled        nodes, that is if node i and node j do not have labels, we        assign a weight W_(ij)=E_(D) [Y_(i), Y_(j)] to that edge. E_(D)        [Y_(i), Y_(j)] denotes the expectation of labeled edges over the        distribution D. Y_(i) and Y_(j)ε{1, −1} and hence,

$\begin{matrix}\begin{matrix}{{E_{D}\left\lbrack {Y_{i},Y_{j}} \right\rbrack} = {\sum\limits_{q,{r \in {\{{1,{- 1}}\}}}}{{qrP}\left\lbrack {{Y_{i} = q},{Y_{j} = r}} \right\rbrack}}} \\{= {{P\left\lbrack {{Y_{i} = 1},{Y_{j} = 1}} \right\rbrack} - {P\left\lbrack {{Y_{i} = 1},{Y_{j} = {- 1}}} \right\rbrack} +}} \\{{{P\left\lbrack {{Y_{i} = {- 1}},{Y_{j} = {- 1}}} \right\rbrack} - {P\left\lbrack {{Y_{i} = {- 1}},{Y_{j} = 1}} \right\rbrack}}} \\{= {\frac{N_{11}}{N_{P}} - \frac{N_{1 - 1}}{N_{P}} + \frac{N_{{- 1} - 1}}{N_{P}} - \frac{N_{- 11}}{N_{P}}}}\end{matrix} & (2)\end{matrix}$

-   -   Since we do not know the labels of any of the nodes for edges in        this category, we assign our most unbiased estimate which is the        indicated expected value.    -   Edges between an unlabeled node and a node with label 1: If an        edge is between an unlabeled node and a node with label 1, we        assign a weight W_(ij)=E_(D) [Y_(i)|Y_(j)=1] to that edge. Here        Y_(i)ε{1, −1}. In this case,

$\begin{matrix}{{E_{D}\left\lbrack {\left. Y_{i} \middle| Y_{j} \right. = 1} \right\rbrack} = {\frac{N_{11}}{N_{1}} - \frac{N_{- 11} + N_{1 - 1}}{N_{1}}}} & (3)\end{matrix}$

-   -   is our unbiased estimate given that one of the nodes has a label        of 1.    -   Edges between an unlabeled node and a node with label −1: If an        edge is between an unlabeled node and a node with label −1, we        assign a weight W_(ij)=E_(D) [Y_(i)|Y_(j)=−1] to that edge. Here        Y_(i)ε{1, −1}. In this case,

$\begin{matrix}{{E_{D}\left\lbrack {\left. Y_{i} \middle| Y_{j} \right. = {- 1}} \right\rbrack} = {\frac{N_{{- 1} - 1}}{N_{{- 1}\;}} - \frac{N_{- 11} + N_{1 - 1}}{N_{- 1}}}} & (4)\end{matrix}$

-   -   is our unbiased estimate given that one of the nodes has a label        of −1.

A weighted version of our example graph T in FIG. 1, is shown by graphT_(w) in FIG. 2.

Characteristics of Matrix Construction

In the previous section, we elucidated a way of constructing a weightmatrix for a partially labeled graph. In this section, we discusscertain characteristics of this construction. We discuss aspects such asrelationships of the suggested weights to standard statistical measuresand the tendencies of the weight matrix as a function of theconnectivity and labeling in the graph. As we will see, our constructionseems to have desirable properties.

Relation to Standard Measures of Association

In the previous section, we described and provided a brief justificationof the procedure to assign weights. It turns out that the weights weassign to edges that have at least one unlabeled node, besides beingunbiased, have more (statistical) semantics.

Proposition 1. The weights assigned to edges between unlabeled nodesi.e. E_(D) [Y_(i), Y_(j)], equate to the gamma test statistic (ρ) in therelational setting.

Proof. From equation 2 we have,

$\begin{matrix}{{E_{D}\left\lbrack {Y_{i},Y_{j}} \right\rbrack} = {\frac{N_{11}}{N_{P}} - \frac{N_{1 - 1}}{N_{P}} + \frac{N_{{- 1} - 1}}{N_{P}} - \frac{N_{- 11}}{N_{P}}}} \\{= {{\frac{1}{N_{P}}\left( {N_{11} + N_{{- 1} - 1}} \right)} - {\frac{1}{N_{P\;}}\left( {N_{- 11} + N_{1 - 1}} \right)}}} \\{= {P_{same} - P_{opp}}} \\{= \rho}\end{matrix}$

The gamma test statistic ρ, is a standard measure of association used instatistics. The value of this statistic ranges from [−1, 1], wherepositive values indicate agreement, negative values indicatedisagreement/inversion and zero indicates absence of association. Thestatistic was historically used to compare the sorted order ofobservations based on values of two attributes. Recently, however, ithas been suggested as a metric to measure auto-correlation in relationaldata graphs. Hence, our assignment of weight to edges between unlabelednodes is the auto-correlation in the graph, which makes intuitive sense.

The weights assigned to edges with one labeled and one unlabeled nodei.e. E_(D) [Y_(i)|Y_(j)=1] or E_(D) [Y_(i)|Y_(j)=−1], based on equations3 and 4 can be written as: (P_(same)|1)−(P_(opp)|1)=ρ1 and(P_(same)|−1)−(P_(opp)|−1)=ρ−1. These could be considered as gamma teststatistics conditioned on one particular type of label and could bereferred to as conditional gamma test statistics.

Behavior of Weight Matrix

We now analyze the behavior of the weight matrix as the labeled edges inour input graph tend towards only connecting nodes with the same labelsor analogously only connecting nodes with different labels.

As our input graph tends to have only nodes with same labels beingconnected, it has the following effect on our weight matrix. The weightof edges between nodes with the same label tends to one, i.e.P_(same)→1. The weight of edges between nodes with different labelstends to zero, i.e. −P_(opp)→0. The weight of edges between unlabelednodes tends to 1, i.e. ρ→1. The weight of the remaining set of edgesalso tends to one, i.e. ρ1, ρ−1→1. Hence, in this situation the weightmatrix becomes an adjacency matrix in the extreme case, with differentlabeled edges vanishing (i.e. being weighted 0) and all other edgesgetting a weight of one. Consequently, our example weighted graph T_(w)in FIG. 2 becomes graph T_(s) in FIG. 3.

As our input graph tends to have only nodes with different labels beingconnected, it has the following effect on our weight matrix. The weightof edges between nodes with the same label tends to zero, i.e.P_(same)→0. The weight of edges between nodes with different labelstends to −1, i.e. −P_(opp)→−1. The weight of edges between unlabelednodes tends to −1, i.e. ρ→−1. The weight of the remaining set of edgesalso tends to −1, i.e. ρ1, ρ−1→−1. Since the graph in the extreme casehas no positive weights, the negative sign in the weights is superfluousand can be eliminated. Hence, in this situation too the weight matrixbecomes an adjacency matrix in the extreme case, with same labeled edgesvanishing (i.e. being weighted 0) and all other edges getting a weightof one. Consequently, our example weighted graph T_(w) in FIG. 2 becomesgraph T_(o) in FIG. 4.

We thus have Ts∪To=T, and the labeled edges in T_(s) and T_(o)complement each other on the labeled portion with respect to the basegraph T. We intuitively expect the labeled edges between differentlylabeled nodes to slowly disappear while the other edges remain present,as edges connecting nodes with the same label become predominant. Wealso expect analogous behavior for the diametric case. As we have seen,these intuitions are captured implicitly, in our modeling of the weightmatrix, thus making the construction procedure more acceptable.

Extensions

In the previous sections, we described a procedure for constructing theweight matrix for a partially labeled graph with no attributes. In thissection, we extend the weighting scheme to include attributeinformation. Moreover, we also present a solution to handle dataheterogeneity using ideas from relational learning.

Modeling with Attributes

For data graphs that have attributes, we want to be able to leveragethis information in addition to the information learned from theconnectivity of the graph, so as to possibly further improve theperformance of our procedure. In particular, we need to extend ourweight assignment procedure to be able to encapsulate attributeinformation. A simple way of combining the already modeled connectivityinformation with the attributes, is to assign a weight to an edge thatis a conical combination of the weight based on connectivity and aweight based on the affinity of attribute values of the connected nodes.Hence, if w_(c) is the weight assigned based on the connectivity for theparticular edge type and w_(a) is the weight assigned based onattributes, then λw_(c)+μw_(a) is the new weight of that edge, where μ,λ≧0. w_(c) is essentially a weight assignment described above (in theAssignment of Weights subsection), viz. P_(same) or ρ etc. w_(a) is afunction of the attributes of the nodes connected by the correspondingedge, which we will soon define. μ and λ are parameters which can bedetermined through standard model selection techniques such ascross-validation. A reasonable indicator for the value of λ could be theabsolute value of the auto-correlation in the graph. While a reasonableestimate of the value of μ could be the absolute value of thecross-correlation between w_(a) and the labeling of the correspondingnodes, i.e. if the labels are the same or different.

In the absence of attributes, our weight assignment w_(c) for any typeof edge, has a value in the interval [−1, 1]. To effectively combine theaforementioned two sources of information, w_(a) needs to be of the samescale as w_(c). One obvious choice could be cosine similarity which iscommonly used in text analytics. Cosine similarity lies in [−1, 1],where values close to 1 imply that the nodes are similar while valuesclose to −1 imply that the nodes are dissimilar. Other choices could bekernel functions (K) such as Gaussian kernel, which normalize populardistance metrics such as Euclidean distance and other l_(p) norms tovalue in [0, 1]. Here, values close to 1 imply similarity and valuesclose to 0 imply dissimilarity. This range can be easily transformed toour usual range of [−1,1] with the same symantics as before, by a simplelinear transformation of the form, 2K−1.

Modeling with Heterogeneous Data

If the data graph has multiple types of entities, resulting in differenttypes of nodes, the procedure previously described cannot be directlyapplied to construct the weight matrix. In such cases, standardrelational learning strategies such as collapsing portions of the graphand using aggregation can be applied to reduce to a graph with a singletype of node with attributes. To this new graph the above extendedprocedure can be applied.

For instance, in a citation graph we may have authors linked to papers,with papers having multiple authors and vice-versa. An example of thisis shown in FIGS. 5A and 5B. In FIG. 5A, we see that the node type Paper510 has two attributes, Title 515 and Area 516, which denote the titleof the paper and the research area it belongs to, respectively. Let theattribute Area 516 be the class label, i.e. we want to classify papersbased on their research area. The node type Author 520 has attributesPaper Title 525 and Age 526, which relates a particular paper to theages of the authors that wrote it. The Title 515 attribute (a primarykey) in Paper 510 is the same as the Paper Title 526 attribute (aforeign key) in Author 520. Hence, each Paper 510 node has threeattributes namely; Title 515, Area 516 and Age 525. The attributes Title515 and Area 516 are called intrinsic attributes as they belong to nodetype Paper 510 and the attribute Age 525 is called a relationalattribute since it belongs to a different linked node type Author 520.Each paper can have variable number of authors and thus each paper wouldbe associated with multiple values of Age 525. A popular solution tothis problem is to aggregate the values of the attribute Age 525 ofAuthor 520 into a single value such that each paper is associated withonly a single Age 525 value. An aggregation function such as averageover the ages of the related authors for each paper can be used. Nowinstead of the Age 525 attribute we can introduce a new attribute AvgAgewhich denotes average age. With this the attributes of Paper node are;Title, Area and AvgAge. Linking authors that co-authored a paper, we nowhave a data graph that links only the Paper node type, with each nodehaving two attributes and a class label.

If we have heterogeneous link types, then the described procedures canbe applied independently to graphs formed from each link type and thefinal result could be obtained by aggregating the individual decisionsthrough standard ensemble label consolidation techniques such as takinga majority vote or a weighted majority based on the correspondingauto-correlations.

Compatibility with Graph Transduction Techniques

Graph based transductive learning approaches impose a trade off betweenthe fitting accuracy of the prediction function on labeled data and thesmoothness of the function over the graph. Typically, the smoothnessmeasure of a prediction function f over the graph G is calculated as:

$\begin{matrix}{{f}_{G}^{2} = {{\sum\limits_{i}{\sum\limits_{j}{W_{ij}{{{f\left( x_{i} \right)} - {f\left( x_{j} \right)}}}^{2}}}} = {\frac{1}{2}{f(X)}^{T}{{Lf}(X)}}}} & (5)\end{matrix}$

where W_(ij) is the weight of the edge between nodes x_(i) and x_(j), Xis the input matrix denoting the nodes, f(x_(i)) is the label of nodex_(i), f(X)=[f(x₁), . . . , f(x_(n))]^(T) if there are n nodes and L isthe graph T aplacian of G.

Given the above measure of function smoothness, a graph Laplacian basedregularization framework estimates the unknown function ƒ as follows:

f ^(opt)=argminQ(X _(l) ,Y _(l) ,f)+η∥f∥ _(G) ²  (6)

where Q(X_(l), Y_(l), f) is a loss function measuring the accuracy overthe labeled set (X_(l), Y_(l)). For example,Q(X_(l),Y_(l),f)=∥f(X_(l))−Y_(l)∥² i.e. squared loss, is a popularchoice.

A weight matrix constructed using our method cannot directly be passedas input to this graph regularization framework. This is because, thesmoothness measure using the graph Laplacian is based on the assumptionthat connected nodes tend to have the same class labels and hence theweights have to be non-negative (i.e. W_(ij)≧0 ∀i,j). However, it iswell-known that edges in relational networks could connect nodes withdifferent labels, which would lead to our construction method assigningnegative weights to such edges. An example is the WEBKB dataset,described in Proceedings of the fifteenth national/tenth conference onArtificial intelligence/Innovative applications of artificialintelligence, by M. Craven et al., AAAI, pages 509-516 (AmericanAssociation for Artificial Intelligence, 1998), where student nodes aretypically connected to faculty nodes more than other student nodes. Toensure compatibility with the graph Laplacian based regularizationframework, we make the following modification:

$\begin{matrix}\begin{matrix}{{f}_{G}^{2} = {\sum\limits_{i}{\sum\limits_{j}{W_{ij}{{{f\left( x_{i} \right)} - {{{sgn}\left( W_{ij} \right)}{f\left( x_{j} \right)}}}}^{2}}}}} \\{= {\frac{1}{2}{f(X)}^{T}{{Mf}(X)}}}\end{matrix} & (7)\end{matrix}$

similar to the one described in the article “Dissimilarity ingraph-based semi-supervised classification” by L. Getoor et al. inArtificial Intelligence and Statistics (AISTATS), 2007, where {tildeover (W)}_(ij)=|W_(ij)|, the degree matrix {tilde over (D)}={{tilde over(D)}_(ij)} is computed as {tilde over (D)}_(ii)=Σ_(j){tilde over(W)}_(ij), M=({tilde over (D)}−{tilde over (W)})+(1−sgn(W))∘W and thesymbol ∘ is the Hadamard product. With this new smoothness measure, wecan now pass our constructed weight matrix as input to this rich classof graph transduction methods.

Experiments

In the previous sections, we described a method to construct a weightmatrix for relational data that serves as input to a rich class of graphbased transductive learning algorithms. In this section, we assess theefficacy of our approach through empirical studies on synthetic and realdata. In these studies, we compare methods across three broadcategories, namely: a) sophisticated relational learning (RL) methods,b) sophisticated graph transduction methods with the weight matrixcomputed using available attributes or adjacency matrix (if noattributes) as input (GTA) and c) relational transductive methods whereour learned weight matrix is passed as input to (enhanced/modified)graph transduction techniques. The situations where methods in categoryc) perform favorably to methods in the other two categories would be theconditions under which use of our procedure would be justified. Therelational learning methods we consider are: MLNs, RDNs, PL-EM and RN.The graph transduction methods we consider are: local global consistency(LGC) method (as described in the article “Pseudolikelihood em forwithin-network relational learning” by R. Xiang et al. in Proceedings ofthe 2008 Eighth IEEE International Conference on Data Mining, pages1103-1108, published by IEEE Computer Society, Washington, D.C., USA)and harmonic functions Gaussian fields (HFGF) method (as described inthe article “Semi-supervised learning using Gaussian fields and harmonicfunctions” by X. Zhu et al. in Proceedings of ICML, pages 912-919,2003).

In all of our experiments, we vary the percentage of known labels fortraining from 5% to 10% to 30% to 70%. The errors for each of themethods are obtained by randomly selecting (100 times) the labeled nodesfor the specified proportions followed by averaging the correspondingerrors. To avoid clutter in the figures reporting the results, we plotonly the following four curves (rather than eight),

-   -   the best performance at each labeled percentage of methods in        category a) (BEST RL),    -   the best performance at each labeled percentage of methods in        category b) (BEST GTA),    -   the LGC method with our constructed weight matrix as input        (LGCW) and    -   the HFGF method with our constructed weight matrix as input        (HFGFW) i.e. methods in category c).

Synthetic Experiments

We generate graphs using well accepted random graph generationprocedures that create real world graphs, namely: forest fire (asdescribed in the article “Graph evolution: Densification and shrinkingdiameters” by J. Leskovec et al. in ACM Trans. Knowl. Discov. Data,1(1):2, 2007), and preferential attachment (as described in the article“Emergence of scaling in random networks” by A. Barabasi et al. inScience, 286:509-512, 1999). These procedures add one node at a time andas nodes get added, we assign a label to it based on an intuitive labelgeneration procedure which is described below.

Setup

We generate graphs consisting of 1000 nodes for the two generationtechniques mentioned above. The parameter settings for forest fire(forward probability=0.37, backward probability=0.32) and preferentialattachment (exponent β=1.6) are derived from the above cited articleswhich indicate that these settings lead to the most realistic graphs.

On the labeling front, we generate a binary labeling ε{1, −1} by asimple procedure for each of these graphs. Whenever a new node is added,with probability p we assign the majority class amongst its labeledneighbors and with probability 1−p we assign one of the two labelsuniformly at random. Hence, the labels generated are dependent on theparticular graph generation procedure and consequently the connectivityof the graph, as is desired. It's easy to see that as p→1 theauto-correlation in the graph increases, leading to more homogeneity orless entropy amongst connected nodes. For each of the two graphgeneration procedures, we create graphs where p is low (i.e. 0.3) andwhere p is high (i.e. 0.8). The low p leads to an auto-correlation ofabout 0.2 (i.e. p≈0.2) while the high p leads to an auto-correlation ofabout 0.7 (i.e. p≈0.7), which are calculated from the generated graphs.

Observations

From FIGS. 6A, 6B, 7A and 7B we see that given a particular graphgeneration procedure—irrespective of the level of auto-correlation—therelative performance of the three different classes of methods isqualitatively similar. GTAs are known to perform particularly well whenonly a few nodes are labeled and this is confirmed in our experiments.As the percentage of known labels increases however, the relationallearning methods start performing better than standard graphtransduction techniques. This is probably due to the fact that mostsophisticated relational learning methods have low bias and relativelyhigh variance. However, with increasing number of labeled nodes thisvariance drops rapidly.

The interesting result, however, is that our weight matrix constructiontechnique seems to capture enough of the complexity of the labeling andthe network structure that besides performing exceedingly well when thegraph is sparsely labeled, it remains competitive with relationallearning methods when the percentage of known labels is moderate tohigh.

Real Data Experiments

For experiments on real data we choose two datasets, namely: WEBKB and areal industrial dataset, BREAD, obtained from a large consumer retailcompany.

Setup

The WEBKB dataset has a collection of webpages obtained from computerscience departments of four US universities. Each webpage belongs to oneof seven categories namely; course, faculty, student, staff, project,department or other. The “other” category webpages were not used asinput in the classification task, but were used to link webpages in theremaining six classes as described in the article “Classification innetworked data: A toolkit and a univariate case study” by S. Macskassyet al. in J. Mach. Learn. Res., 8:935-983, 2007. We performedexperiments on the four graphs formed—one for each university—andcomputed the average error over the four universities for each of thelearning methods.

The BREAD dataset has sales information about bread products sold indifferent stores in the northeastern United States. The dataset hasinformation from 2347 stores. For each store we know its location, weknow if the store met or underachieved its target quarterly sales, weknow the amounts it had on promotion during that period, we know thequantity ordered during that period and we know the amount reclaimedduring that period. Based on location, we can form a graph linking theclosest stores together. With this, we have a dataset of size 2347 andwhere each node in the graph has four attributes. Setting the attributeindicating whether the sales met or underachieved the expected amount asour class label, we obtain a graph where each node has three explanatoryattributes.

Observations

On the WEBKB dataset we see in FIG. 8A that the best GTA is better thanthe relational methods when a small percentage (<20%) of labels areknown, but the relational methods quickly close this gap and startoutperforming the GTAs with more label information. Our weight matrixconstruction method however, performs better than the other two classesof methods at low label proportions and remains competitive with therelational methods as this proportion increases, unlike the GTAs. Thisfavorable behavior can most likely be attributed to our method beingable to effectively model the strength (i.e. the numerical value) anddirection (i.e. + or −) of dependencies between linked entities,something GTAs seemingly fail to capture.

On the BREAD dataset we see in FIG. 8B that the GTAs are much worse thanthe other class of methods. A possible reason for this is that storesnear to one another typically compete with each other for the same typeof products and hence, our input graph exhibits strong negativeauto-correlation. Since GTAs predominantly model similarity betweenlinked entities, their performance is practically unchanged even whenthe percentage of known labels is increased. The relational methodsperform much better than GTAs in this setting. In contrast to GTAs, theyeffectively capture the dissimilarity between linked nodes as the numberof known labels increases. However, our weight matrix constructionmethod seems to capture this relationship much earlier with only a smallpercentage of labels known.

Discussion

In this disclosure, we have provided a simple yet novel way ofconstructing a weight matrix for partially labeled relational graphsthat may be directed or undirected, that may or may not have attributesand that may be homogeneous or heterogeneous. We have described themanner in which such a weight matrix can serve as input to a rich classof graph transduction methods through a modified graph Laplacian basedregularization framework. We have portrayed the desirable properties ofthis construction method and showcased its effectiveness in capturingcomplex dependencies through experiments on synthetic and real data.

In the future, it would be interesting to extend this procedure toperform multi-class classification in a single shot, rather than havingto perform multiple binary classification tasks. This would most likelyimprove the actual running time, though not necessarily the timecomplexity in terms of O(.). On the theory side, it might be of someinterest to analyze the synthetic label generation procedure introducedin this paper, for different types of graphs. One could use ideas fromthe theory of random walks to determine tendencies of the labelgeneration procedure. From a learning theory perspective, one couldpotentially derive error bounds as functions of p (amongst otherparameters), and if one were to express p in terms of auto-correlationρ, one would have error bounds as functions of ρ. This would be of someinterest since ρ can be computed from static graphs or given a snapshotof an evolving graph, where one does not have to know the order in whichthe nodes were attached, thus making the error bound applicable tographs in a larger set of applications.

While the invention has been described in terms of a single preferredembodiment, those skilled in the art will recognize that the inventioncan be practiced with modification within the spirit and scope of theappended claims.

1. A method for extending a partially labeled data graph to unlabelednodes in a single network classification, comprising: constructing aweight matrix for data in a single network classification, the weightmatrix incorporating a conical weighting scheme that weighs importanceof links relative to attributes; applying the weight matrix to the data;and applying a graph transduction method to the weighted data togenerate labels for the unlabeled nodes.
 2. A method as in claim 1,wherein the weight matrix uses a modified graph Laplacian basedregularization framework.
 3. A method as in claim 2, further comprising:partitioning edges of the data graph into categories; assigning a weightto each category; and assigning to each edge the weight of itsrespective category.
 4. A method as in claim 3, wherein the categoriesare edges between nodes with the same label; edges between nodes withopposite labels; edges between unlabeled nodes; edges between anunlabeled node and a node with a label 1; and edges between an unlabelednode and a node with a label −1.
 5. A method as in claim 4, whereinedges between unlabeled nodes are assigned a weight denoting anexpectation based on a distribution of edges that have labels.
 6. Amethod as in claim 4, wherein edges between an unlabeled node and alabeled node are assigned a weight denoting an expectation based on adistribution of edges that have labels, said distribution being limitedto those edges having one node equal to the labeled node.
 7. A method asin claim 3, further comprising assigning to each edge a weight that is aconical combination of a weight based on the respective category and aweight based on affinity of attribute values of nodes connected by saidedge.
 8. A method as in claim 1, wherein applying a graph transductionmethod further comprises imposing a tradeoff between a fitting accuracyof a prediction function on labeled data and a smoothness of theprediction function over the graph.
 9. A method as in claim 8, furthercomprising estimating the smoothness of the prediction function for thegraph Laplacian based regularization framework; and modifying theprediction function to ensure compatibility between the graphtransduction method and the graph Laplacian based regularizationframework.
 10. A system for extending a partially labeled data graph tounlabeled nodes in a single network classification, comprising: a weightmatrix for data in a single network classification, the weight matrixincorporating a conical weighting scheme that weighs importance of linksrelative to attributes; means for applying the weight matrix to thedata; and a graph transduction method applied to the weighted data togenerate labels for the unlabeled nodes.
 11. A system as in claim 10,wherein the weight matrix uses a modified graph Laplacian basedregularization framework.
 12. A system as in claim 11, furthercomprising: means for partitioning edges of the data graph intocategories; means for assigning a weight to each category; and means forassigning to each edge the weight of its respective category.
 13. Asystem as in claim 12, wherein the categories are edges between nodeswith the same label; edges between nodes with opposite labels; edgesbetween unlabeled nodes; edges between an unlabeled node and a node witha label 1; and edges between an unlabeled node and a node with a label−1.
 14. A system as in claim 13, wherein edges between unlabeled nodesare assigned a weight denoting an expectation based on a distribution ofedges that have labels.
 15. A system as in claim 13, wherein edgesbetween an unlabeled node and a labeled node are assigned a weightdenoting an expectation based on a distribution of edges that havelabels, said distribution being limited to those edges having one nodeequal to the labeled node.
 16. A system as in claim 12, furthercomprising assigning to each edge a weight that is a conical combinationof a weight based on the respective category and a weight based onaffinity of attribute values of nodes connected by said edge.
 17. Asystem as in claim 10, wherein a graph transduction method is applied byimposing a tradeoff between a fitting accuracy of a prediction functionon labeled data and a smoothness of the prediction function over thegraph.
 18. A system as in claim 17, further comprising means forestimating the smoothness of the prediction function for the graphLaplacian based regularization framework; and means for modifying theprediction function to ensure compatibility between the graphtransduction method and the graph Laplacian based regularizationframework.
 19. A computer implemented system for extending a partiallylabeled data graph to unlabeled nodes in a single networkclassification, comprising: a computer processor for executing computercode; first computer code for constructing a weight matrix for data in asingle network classification, the weight matrix incorporating a conicalweighting scheme that weighs importance of links relative to attributes;second computer code for applying the weight matrix to the data; andthird computer code for applying a graph transduction method to theweighted data to generate labels for the unlabeled nodes.
 20. A computerimplemented system as in claim 19, wherein the weight matrix uses amodified graph Laplacian based regularization framework.